Unstandardized Accounting Terminology
  • Home
  • Dictionaries
  • Robustness Checks
  • Data & Code

On this page

  • Overview
    • Construction Approaches
    • Key Characteristics
  • Term Lists
  • Concept Lists
  • t-SNE Visualizations

Dictionaries

This page provides comprehensive dictionaries of accounting terminology constructed using both top-down and bottom-up approaches. These dictionaries enable researchers to identify and measure terminological variation in financial reporting practice.


Overview

We construct alternative accounting dictionaries to capture the universe of accounting terminology and measure standardization in financial reporting. Each dictionary consists of two components:

  • Term Lists: Unique accounting terms that appear in financial reports
  • Concept Lists: Granular mappings showing which terms are used to describe the same underlying accounting concepts (i.e., synonyms)

Construction Approaches

Top-Down (Authoritative Sources): We collect terms from IFRS, US GAAP, and UK GAAP standards, plus specialized accounting dictionaries and the EU’s IATE database. Terms explicitly classified as synonyms are grouped by their underlying concepts. All lists are refined using GPT-based validation and manual checks, then restricted to terminology actually observed in our global corpus of financial reports.

Bottom-Up (XBRL Filings): We extract terms directly from financial statements by parsing XBRL filings on EDGAR. Specifically, we use Exhibit 101.LAB files, which map XBRL taxonomy tags to the natural language labels firms actually use in their reports. This captures real-world variation in reporting practice. To reduce noise, we require that 10-K terms appear in at least 20 distinct filings and 20-F terms in at least 5 distinct filings. We apply a majority disambiguation rule, removing terms that appear in less than 5% of filings for a given concept.

Key Characteristics

The table below summarizes the differences across our dictionaries:

Characteristic Top-Down Bottom-Up (10-K) Bottom-Up (20-F)
Source Standards & Dictionaries ~50,000 U.S. 10-K filings 20-F filings (IFRS)
Avg. synonyms/concept 5.4 12.0 -
Textual similarity 0.78 0.86 -
Concentration 0.58 0.54 -

Interpretation: The bottom-up approach captures more variation per concept, including differences in writing conventions and stylistic choices beyond “true” synonyms. The higher textual similarity (0.86 vs 0.78) reflects that XBRL-based terms often share common structure from taxonomy labels, while lower concentration (0.54 vs 0.58) indicates that term usage is more evenly distributed rather than dominated by a few common terms.


Term Lists

Term lists provide the complete set of unique accounting terms found in financial reports. These are useful for text analysis, dictionary-based approaches, and understanding the breadth of accounting vocabulary.

Download all term lists: 📥 Excel File (2.6 MB)

  • Top-Down
  • Bottom-Up (10-K)
  • Bottom-Up (20-F)

Source: IFRS, US GAAP, UK GAAP standards, and specialized accounting dictionaries

Validation: GPT-based refinement and manual validation, restricted to terms observed in our global corpus

Source: ~50,000 U.S. 10-K XBRL filings (2009-2025)

Filtering: Terms must appear in 20+ distinct filings; validated against US-GAAP taxonomy

Source: 20-F XBRL filings from non-U.S. firms using IFRS Taxonomy (2009-2025)

Filtering: Terms must appear in 5+ distinct filings; validated against IFRS taxonomy


Concept Lists

Concept lists map individual terms to accounting concepts, revealing which terms are used interchangeably (i.e., as synonyms) to describe the same economic substance. Each row represents a unique term-concept pairing, enabling detailed analysis of terminological variation.

Use cases: Measuring standardization, identifying synonym sets, analyzing cross-border reporting differences, harmonizing accounting data across firms and time periods.

Download all concept lists: 📥 Excel File

  • Top-Down
  • Bottom-Up (10-K)
  • Bottom-Up (20-F)

Construction: Terms from dictionaries and standards explicitly classified as synonyms are grouped into concepts. Concepts are validated using graph theory (complete graph property) and GPT-based checks to ensure all terms within a concept are truly interchangeable.

Structure: Each row shows a term (TID) and its associated concept (CID), along with the n-gram count.

Construction: Terms are grouped by XBRL taxonomy tags, where each tag represents a distinct accounting concept. Terms linked to multiple tags are assigned to their primary concept using a majority rule (5% threshold).

Structure: Each row shows which term (TID) maps to which XBRL concept (CID). This reveals how U.S. domestic filers describe the same accounting items using different terminology.

Construction: Same methodology as 10-K, but using IFRS Taxonomy tags from 20-F filings. Captures how international filers describe accounting concepts.

Structure: Each row shows term-concept mappings based on IFRS Taxonomy, revealing cross-border variation in financial reporting language.


t-SNE Visualizations

Explore the semantic structure of individual concepts using t-SNE projections. Each plot shows how terms within a concept cluster together in two-dimensional embedding space.

t-SNE plot

Select dictionary, model, and concept to view plot

Interpretation: t-SNE plots visualize the semantic structure of concepts in two dimensions. Terms belonging to the same concept should cluster tightly together, while different concepts should be well-separated. This validates that our concept groupings capture semantically coherent synonym sets.


 

© 2025 | Supplementary materials for JAE submission